[doc] refactor: split technique guidance into docs, skills, and knowledge by yaoyu-33 · Pull Request #2809 · NVIDIA-NeMo/Megatron-Bridge

yaoyu-33 · 2026-03-15T22:23:02Z

Summary

move technique content to a clearer three-layer model: stable docs in docs/training, operational guides in skills/perf-techniques, and machine-readable cards in knowledge/techniques
rewrite the training docs to keep stable guidance there while moving enablement details, code anchors, and verification commands out of the docs pages
add missing knowledge cards and a dedicated hybrid context parallel doc so the technique set no longer depends on duplicated validated-knowledge pages

Test plan

Run pre-commit run --files on the staged docs, skills, and knowledge files
Verify the resulting staged tree is docs-only and does not include the roadmap proposal

Made with Cursor

Summary by CodeRabbit

Documentation
- Added strategy selection guide for choosing parallelism approaches based on model type, hardware topology, and sequence length.
- Introduced comprehensive guides for CUDA graphs optimization, hybrid context parallelism, and Megatron FSDP.
- Simplified and reorganized communication overlap and packed sequences documentation with clearer usage guidelines.
- Consolidated resiliency feature documentation with stable constraints and implementation guidance.
- Established structured technique documentation with operational guides and validation cards for performance optimization strategies.

Clarify which training guidance is stable documentation versus operational enablement so technique knowledge can be maintained without duplicated docs. Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> Made-with: Cursor

yaoyu-33 · 2026-03-15T22:23:06Z

/ok to test e397f32

copy-pr-bot · 2026-03-15T22:23:07Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Add docs/skills/knowledge split for parallelism strategy selection, based on real recipe layouts from DeepSeek-V2/V3, Qwen3, Kimi-K2, etc. Key insight: MoE models size TP by active params (often TP=1-2) and use EP as the primary scaling dimension, unlike dense models. Verified TP=2+PP=2+SP on cluster with llama32_1b_pretrain_config. Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> Made-with: Cursor

Move perf-technique skills from flat .md files to subfolders with SKILL.md + card.yaml, consolidating knowledge/techniques/ YAML cards alongside their skills. Add resiliency skill and card. Refactor docs/training/resiliency.md to be concise conceptual docs, offloading operational details to skills/resiliency/SKILL.md. Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> Made-with: Cursor

Add stable docs page, operational SKILL.md, and machine-readable card.yaml for CUDA graph capture in Megatron Bridge. Update performance-guide.md to reference current cuda_graph_impl/scope config fields instead of the deprecated enable_cuda_graph. Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> Made-with: Cursor

yaoyu-33 · 2026-03-17T21:17:06Z

/ok to test 5d8044c

coderabbitai · 2026-03-17T21:44:42Z

📝 Walkthrough

Walkthrough

Documentation restructuring and expansion across training guides and knowledge organization. Consolidates verbose operational guidance into stable, concise overviews in docs/training/ while establishing a new skills/perf-techniques/ directory with structured SKILL.md and card.yaml pairs for performance techniques and resiliency features.

Changes

Cohort / File(s)	Summary
Training Documentation Rewrites `docs/training/communication-overlap.md`, `docs/training/megatron-fsdp.md`, `docs/training/packed-sequences.md`, `docs/training/resiliency.md`	Replaces granular, mode-specific guidance and extensive code examples with high-level overviews, stable constraints, and "When to Use It" sections. Removes detailed configuration snippets and procedural walkthroughs; consolidates into streamlined feature descriptions and references to related documentation.
New Training Documentation `docs/training/cuda-graphs.md`, `docs/training/hybrid-context-parallel.md`	Introduces new stable documentation pages covering CUDA graphs support (overview, implementations, configuration, constraints) and hierarchical context parallelism (Bridge-facing a2a+p2p transport path and hierarchical_context_parallel_sizes).
Documentation Index Updates `docs/parallelisms.md`, `docs/performance-guide.md`, `docs/training/README.md`	Adds parallelism strategy selection guide, updates CUDA graph references, and inserts new table entries for Hybrid Context Parallel and Megatron FSDP features.
Skills Infrastructure `knowledge/techniques/README.md`, `skills/perf-techniques/README.md`	Establishes skills directory structure documentation. Technique knowledge cards moved to co-locate with operational guides in skills/perf-techniques// as card.yaml, with stable human-facing docs remaining in docs/training/.
Performance Technique Skills `skills/perf-techniques/cuda-graphs/...`, `skills/perf-techniques/expert-parallel-overlap/...`, `skills/perf-techniques/hybrid-context-parallel/...`, `skills/perf-techniques/megatron-fsdp/...`, `skills/perf-techniques/parallelism-strategies/...`, `skills/perf-techniques/sequence-packing/...`, `skills/perf-techniques/tp-dp-comm-overlap/...`	Introduces seven new technique skill directories, each containing SKILL.md (operational guide with enablement steps, code anchors, pitfalls, verification procedures) and card.yaml (structured metadata with validation status, feature meanings, constraints, and evidence references).
Resiliency Skills `skills/resiliency/SKILL.md`, `skills/resiliency/card.yaml`	Adds comprehensive resiliency documentation covering fault tolerance, straggler detection, in-process restart, preemption, and re-run state machine features, with configuration details, code anchors, pitfalls, and verification guidance; complemented by structured card.yaml with validation status and constraints.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

🚥 Pre-merge checks | ✅ 4

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title '[doc] refactor: split technique guidance into docs, skills, and knowledge' accurately and concisely describes the main refactoring objective—reorganizing technique guidance into a three-layer model (docs, skills, knowledge) as confirmed by commit messages and file structure changes throughout the changeset.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Test Results For Major Changes	✅ Passed	PR contains only documentation (.md) and configuration (.yaml) files with no code changes affecting numerics, convergence, or performance.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch yuya/public-perf-techniques-knowledge

📝 Coding Plan

Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 19

🧹 Nitpick comments (3)

skills/perf-techniques/cuda-graphs/SKILL.md (1)

217-221: Consider hyphenating compound modifier.

Static analysis suggests "TE-scoped" for clarity when used as a compound adjective before a noun.

📝 Optional fix

 4. **Memory overhead**: CUDA graphs pin all intermediate buffers for the
-   graph's lifetime (no memory reuse). TE scoped graphs add a few GB;
+   graph's lifetime (no memory reuse). TE-scoped graphs add a few GB;
    full-iteration graphs can increase peak memory by 1.5–2×. `PP > 1`
    compounds overhead since each stage holds its own graph.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@skills/perf-techniques/cuda-graphs/SKILL.md` around lines 217 - 221, Change
the compound modifier "TE scoped graphs" to a hyphenated form "TE-scoped graphs"
in the paragraph discussing memory overhead (the line that currently reads "TE
scoped graphs add a few GB;"), so the compound adjective is clear when used
before the noun; update the occurrence(s) of "TE scoped" to "TE-scoped" to
maintain consistent style.

skills/resiliency/SKILL.md (1)

263-265: Consider more concise phrasing (optional).

Minor style suggestion from static analysis: "is not compatible with" could be "is incompatible with" for conciseness.

📝 Optional fix

-3. **IPR + NeMo-Run**: In-process restart is not compatible with NeMo-Run
+3. **IPR + NeMo-Run**: In-process restart is incompatible with NeMo-Run
    or Slurm preemption plugins. Requires specific PyTorch/NCCL versions
    and env vars.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@skills/resiliency/SKILL.md` around lines 263 - 265, Replace the phrase "is
not compatible with" with the more concise "is incompatible with" in the bullet
that begins "IPR + NeMo-Run" so the line reads "IPR + NeMo-Run: In-process
restart is incompatible with NeMo-Run or Slurm preemption plugins." Keep the
rest of the sentence (the note about specific PyTorch/NCCL versions and env
vars) unchanged; update only the quoted wording in the SKILL.md bullet under
"IPR + NeMo-Run".

skills/perf-techniques/sequence-packing/card.yaml (1)

39-50: Avoid pseudo-values in this machine-readable block.

match_seq_length, ">1", and 2 * context_parallel_size are prose placeholders inside scalar YAML values. That makes recommended_path hard for tooling to consume consistently and undercuts the machine-readable-card goal. Consider moving these into explicit constraint fields or structured metadata instead.

As per coding guidelines, "Do not add arbitrary defaults for configs, be as explicit as possible".

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@skills/perf-techniques/sequence-packing/card.yaml` around lines 39 - 50, The
YAML uses prose placeholders inside scalar values (e.g.,
recommended_path.llm_packed_sft.dataset.packed_sequence_specs.packed_sequence_size:
match_seq_length, recommended_path.vlm_in_batch_packing.train.micro_batch_size:
">1", and
recommended_path.cp_finetuning.dataset.packed_sequence_specs.pad_seq_to_mult: 2
* context_parallel_size); replace these pseudo-values with explicit,
machine-readable fields or structured constraints: for example add a constraint
object like packed_sequence_specs.packed_sequence_size_mode: "match_seq_length"
or packed_sequence_specs.packed_sequence_size_constraint: {mode:
"match_seq_length"}, replace the string range with a min/max field such as
vlm_in_batch_packing.train.micro_batch_size_min: 2, and replace arithmetic with
a factor field such as
cp_finetuning.dataset.packed_sequence_specs.pad_seq_to_mult_factor: 2 plus a
reference name context_parallel_size_variable so tooling can compute
pad_seq_to_mult at runtime—update llm_packed_sft, cp_finetuning, and
vlm_in_batch_packing accordingly so all values are explicit and
machine-readable.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/parallelisms.md`:
- Around line 508-512: Update the broken cross-reference in
docs/parallelisms.md: replace the non-existent
../knowledge/techniques/parallelism_strategies.yaml link with the relocated card
path under skills/perf-techniques (e.g.,
../skills/perf-techniques/parallelism-strategies/card.yaml) so the "Knowledge
card" link in the "Related Artifacts" section points to the correct card.yaml;
edit the link text that currently references parallelism_strategies.yaml and
ensure it matches the directory name used by the existing skill file
(parallelism-strategies) and the card.yaml filename.

In `@docs/training/communication-overlap.md`:
- Around line 12-13: Update the stale skill file references that point to
pre-split markdown files by changing them to the new per-technique SKILL.md
layout: replace `skills/perf-techniques/tp-dp-comm-overlap.md` with
`skills/perf-techniques/tp-dp-comm-overlap/SKILL.md` and replace
`skills/perf-techniques/expert-parallel-overlap.md` with
`skills/perf-techniques/expert-parallel-overlap/SKILL.md`; also apply the same
path updates to the other occurrences noted (101-102) so all references point to
the new SKILL.md locations.

In `@docs/training/cuda-graphs.md`:
- Around line 45-56: The docs use two different RNG-tracker
names—cfg.rng.te_rng_tracker in the example and use_te_rng_tracker in the
constraints—so pick one and make them consistent; update the constraint line to
refer to the same symbol used in the example (cfg.rng.te_rng_tracker) or
conversely change the example to use use_te_rng_tracker, and ensure the
constraint referencing cuda_graph_impl still reads "cfg.rng.te_rng_tracker must
be True when cfg.model.cuda_graph_impl != 'none'".
- Line 1: The new document docs/training/cuda-graphs.md is orphaned; update the
relevant toctree by adding a reference to "training/cuda-graphs.md" in the docs
index—specifically edit docs/index.md (or the appropriate docs/training index)
and insert the relative path or title entry into the toctree/list so the page is
included in the site navigation; ensure the entry matches the existing toctree
style and run a local docs build to verify the warning is resolved.

In `@docs/training/hybrid-context-parallel.md`:
- Around line 7-8: Update the opening pointer string that currently references
"skills/perf-techniques/hybrid-context-parallel.md" to use the new layout path
"skills/perf-techniques/hybrid-context-parallel/SKILL.md" so the link matches
the rest of the file; search for the literal reference
"skills/perf-techniques/hybrid-context-parallel.md" and replace it with
"skills/perf-techniques/hybrid-context-parallel/SKILL.md".
- Around line 1-8: The new doc "Hybrid / Hierarchical Context Parallel"
(hybrid-context-parallel.md) is not listed in any Sphinx toctree and will be
orphaned; update the relevant toctree (e.g., add an entry in docs/index.md or
the training toctree that aggregates docs/training/*.md) to include
"training/hybrid-context-parallel.md" so the page is reachable in the built docs
and the toc.not_included warning is resolved.

In `@docs/training/megatron-fsdp.md`:
- Line 10: The docs reference the old path
"skills/perf-techniques/megatron-fsdp.md" which is stale; update all internal
links in docs/training/megatron-fsdp.md that point to
"skills/perf-techniques/megatron-fsdp.md" (including the occurrences around the
top and near line 100) to the new path
"skills/perf-techniques/megatron-fsdp/SKILL.md" so internal navigation works;
ensure you replace every exact match of the old filename string with the new
"SKILL.md" path.

In `@docs/training/packed-sequences.md`:
- Around line 8-10: The docs reference
'skills/perf-techniques/sequence-packing.md' which no longer exists; update
every occurrence (including the one at lines ~83-89) to point to the relocated
guide 'skills/perf-techniques/sequence-packing/SKILL.md' so links resolve
correctly; search the file for the string "sequence-packing.md" and replace it
with "sequence-packing/SKILL.md" in the content and any link markup (the
affected file is docs/training/packed-sequences.md).

In `@docs/training/resiliency.md`:
- Around line 15-24: The table entry for "Local checkpointing" is missing its
corresponding documentation sections; add a matching "Local checkpointing"
feature block with "When to Use It" and "Stable Constraints" (and any other
sections used for other features like motivation/usage) following the exact
structure and tone used for entries such as "Async checkpoint save" and
"Preemption", or if this feature should not be documented yet, remove the "Local
checkpointing" row from the table to keep the table consistent; locate
references to "Local checkpointing" in the doc and implement the new sections
with clear guidance on motivation, technical approach, architecture, usage
examples, and implementation details to satisfy the project documentation
requirements.

In `@skills/perf-techniques/expert-parallel-overlap/card.yaml`:
- Around line 27-33: Update the card's known_constraints to mirror the runtime
checks in src/megatron/bridge/training/comm_overlap.py: add a constraint that
moe_shared_expert_overlap must be disabled/false (referencing the
moe_shared_expert_overlap check), require torch version >= 2.6.0 (referencing
the torch version check), and enumerate the recompute-related constraints
enforced in the file (recompute_granularity, recompute_method,
recompute_num_layers) plus the mtp_num_layers constraint so configurations
failing those checks are rejected by the card; ensure each constraint references
the exact setting names used in the code (moe_shared_expert_overlap,
recompute_granularity, recompute_method, recompute_num_layers, mtp_num_layers,
torch version) and phrase them consistently with the existing items.

In `@skills/perf-techniques/expert-parallel-overlap/SKILL.md`:
- Around line 106-110: Update the "Success criteria" in SKILL.md to remove the
hard-coded "41 passed" count and instead require that the targeted test files
pass with zero failures; specifically replace the "41 passed" bullet with a
statement that tests under test_comm_overlap.py (MoE overlap and delayed-wgrad
validation) and test_deepep.py (DeepEP/HybridEP helper activation and GPU
gating) must pass with no failures, so the criteria track those files rather
than a total pytest number.
- Around line 41-49: Add the missing hard requirement that
moe_shared_expert_overlap = False to the "Required constraints" list because the
code asserts this when overlap_moe_expert_parallel_comm is enabled; update
SKILL.md's constraints block to include the exact constraint
"moe_shared_expert_overlap = False" alongside the other items and mention it
applies whenever overlap_moe_expert_parallel_comm is true so readers know it's
not optional.

In `@skills/perf-techniques/hybrid-context-parallel/card.yaml`:
- Around line 51-60: The evidence list in card.yaml references five files that
don't exist; either add the missing files (create
scripts/verify_hybrid_context_parallel.py and the four Megatron-LM stubs under
3rdparty/Megatron-LM/megatron/core and megatron/training) or remove those
entries from the evidence array to keep the card accurate; locate the evidence
key in skills/perf-techniques/hybrid-context-parallel/card.yaml and update it so
only existing files (src/megatron/bridge/training/initialize.py, config.py,
model_load_save.py, and docs/performance-guide.md) remain or add the missing
files with the expected verification script and Megatron modules.

In `@skills/perf-techniques/hybrid-context-parallel/SKILL.md`:
- Around line 94-103: The documentation references a missing verification script
scripts/verify_hybrid_context_parallel.py in SKILL.md which causes "No such
file" errors; either add the missing script or remove/update the verification
commands. Locate the SKILL.md section that runs the verification (the
CUDA_VISIBLE_DEVICES... uv run python -m torch.distributed.run commands) and (a)
if the verification logic exists elsewhere, point the commands to the correct
script/module name and update the examples, or (b) if the verification is
intended to be provided, implement scripts/verify_hybrid_context_parallel.py
matching the CLI flags (--mode mpu/decentralized) and expected behavior, or (c)
remove the verification subsection entirely; update the README references and
any mentions of verify_hybrid_context_parallel.py accordingly.

In `@skills/perf-techniques/megatron-fsdp/card.yaml`:
- Around line 3-6: The torch_fsdp2_runtime status in the card.yaml is
inconsistent with the documented failure in known_limitations; update the
torch_fsdp2_runtime field (in skills/perf-techniques/megatron-fsdp/card.yaml)
from "unclear" to an explicit status reflecting the runtime failure (e.g.,
"failed" or "runtime_failure") and ensure the value aligns with the
known_limitations note about the Torch FSDP2 TypeError from pg_collection so
readers see the runtime failure state consistently.

In `@skills/perf-techniques/megatron-fsdp/SKILL.md`:
- Around line 27-35: The example config block for Megatron FSDP is missing the
cfg.ddp.average_in_collective = False override and thus diverges from the
recommended minimal override and perf-harness overrides; update the snippet that
sets cfg.dist.use_megatron_fsdp, cfg.ddp.use_megatron_fsdp,
cfg.ddp.data_parallel_sharding_strategy, cfg.checkpoint.ckpt_format,
cfg.checkpoint.save, and cfg.checkpoint.load to also set
cfg.ddp.average_in_collective = False so the copy-paste example matches the
recommended override set.

In `@skills/perf-techniques/parallelism-strategies/SKILL.md`:
- Around line 207-208: The cross-reference path in the paragraph mentioning
`skills/perf-techniques/tp-dp-comm-overlap.md` is incorrect; update that
reference to the correct target
`skills/perf-techniques/tp-dp-comm-overlap/SKILL.md` so the link points to the
existing SKILL.md file (edit the occurrence of
`skills/perf-techniques/tp-dp-comm-overlap.md` in the current `SKILL.md` file
and replace it with `skills/perf-techniques/tp-dp-comm-overlap/SKILL.md`).

In `@skills/perf-techniques/sequence-packing/SKILL.md`:
- Around line 39-44: The doc block must mention that enabling pad_cu_seqlens
requires packed-sequence metadata to be present; update the text to explicitly
state that when cfg.dataset.packed_sequence_specs.pad_cu_seqlens = True you must
supply packed-sequence metadata for your dataset (or disable pad_cu_seqlens),
because the dataset code (see src.megatron.bridge.data.datasets.sft assertion in
the packed-sequence handling) will assert that metadata exists; reference
pad_cu_seqlens, cfg.dataset.dataset_kwargs["pad_to_max_length"], and the
assertion in the sft dataset so users with custom packed datasets know to
generate/provide the required packed metadata.

In `@skills/perf-techniques/tp-dp-comm-overlap/card.yaml`:
- Around line 31-36: Add a new known_constraints entry stating that TP overlap
requires Transformer Engine to be present; specifically mention that TP overlap
is disabled when HAVE_TE is false (see the gating in comm_overlap where the code
checks elif not HAVE_TE), so include "TP overlap requires Transformer Engine
(HAVE_TE)". Keep the same YAML list style and place it alongside the existing TP
overlap constraints referencing tensor_model_parallel_size and
sequence_parallel.

---

Nitpick comments:
In `@skills/perf-techniques/cuda-graphs/SKILL.md`:
- Around line 217-221: Change the compound modifier "TE scoped graphs" to a
hyphenated form "TE-scoped graphs" in the paragraph discussing memory overhead
(the line that currently reads "TE scoped graphs add a few GB;"), so the
compound adjective is clear when used before the noun; update the occurrence(s)
of "TE scoped" to "TE-scoped" to maintain consistent style.

In `@skills/perf-techniques/sequence-packing/card.yaml`:
- Around line 39-50: The YAML uses prose placeholders inside scalar values
(e.g.,
recommended_path.llm_packed_sft.dataset.packed_sequence_specs.packed_sequence_size:
match_seq_length, recommended_path.vlm_in_batch_packing.train.micro_batch_size:
">1", and
recommended_path.cp_finetuning.dataset.packed_sequence_specs.pad_seq_to_mult: 2
* context_parallel_size); replace these pseudo-values with explicit,
machine-readable fields or structured constraints: for example add a constraint
object like packed_sequence_specs.packed_sequence_size_mode: "match_seq_length"
or packed_sequence_specs.packed_sequence_size_constraint: {mode:
"match_seq_length"}, replace the string range with a min/max field such as
vlm_in_batch_packing.train.micro_batch_size_min: 2, and replace arithmetic with
a factor field such as
cp_finetuning.dataset.packed_sequence_specs.pad_seq_to_mult_factor: 2 plus a
reference name context_parallel_size_variable so tooling can compute
pad_seq_to_mult at runtime—update llm_packed_sft, cp_finetuning, and
vlm_in_batch_packing accordingly so all values are explicit and
machine-readable.

In `@skills/resiliency/SKILL.md`:
- Around line 263-265: Replace the phrase "is not compatible with" with the more
concise "is incompatible with" in the bullet that begins "IPR + NeMo-Run" so the
line reads "IPR + NeMo-Run: In-process restart is incompatible with NeMo-Run or
Slurm preemption plugins." Keep the rest of the sentence (the note about
specific PyTorch/NCCL versions and env vars) unchanged; update only the quoted
wording in the SKILL.md bullet under "IPR + NeMo-Run".

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: d69fecaa-5a5e-4c83-8fbc-5de00f65b2bf

📥 Commits

Reviewing files that changed from the base of the PR and between bb5790f and 5d8044c.

📒 Files selected for processing (27)

docs/parallelisms.md
docs/performance-guide.md
docs/training/README.md
docs/training/communication-overlap.md
docs/training/cuda-graphs.md
docs/training/hybrid-context-parallel.md
docs/training/megatron-fsdp.md
docs/training/packed-sequences.md
docs/training/resiliency.md
knowledge/techniques/README.md
skills/perf-techniques/README.md
skills/perf-techniques/cuda-graphs/SKILL.md
skills/perf-techniques/cuda-graphs/card.yaml
skills/perf-techniques/expert-parallel-overlap/SKILL.md
skills/perf-techniques/expert-parallel-overlap/card.yaml
skills/perf-techniques/hybrid-context-parallel/SKILL.md
skills/perf-techniques/hybrid-context-parallel/card.yaml
skills/perf-techniques/megatron-fsdp/SKILL.md
skills/perf-techniques/megatron-fsdp/card.yaml
skills/perf-techniques/parallelism-strategies/SKILL.md
skills/perf-techniques/parallelism-strategies/card.yaml
skills/perf-techniques/sequence-packing/SKILL.md
skills/perf-techniques/sequence-packing/card.yaml
skills/perf-techniques/tp-dp-comm-overlap/SKILL.md
skills/perf-techniques/tp-dp-comm-overlap/card.yaml
skills/resiliency/SKILL.md
skills/resiliency/card.yaml

coderabbitai · 2026-03-17T21:44:46Z

docs/parallelisms.md

+## Related Artifacts
+
+- **Operational skill**: [skills/perf-techniques/parallelism-strategies.md](../skills/perf-techniques/parallelism-strategies.md) — enablement, pitfalls, memory estimation, verification
+- **Knowledge card**: [knowledge/techniques/parallelism_strategies.yaml](../knowledge/techniques/parallelism_strategies.yaml) — structured metadata and validation status
+


⚠️ Potential issue | 🟡 Minor

Fix broken cross-reference to knowledge card.

The pipeline failure indicates the path ../knowledge/techniques/parallelism_strategies.yaml does not exist. Based on the README at knowledge/techniques/README.md, cards have been relocated to skills/perf-techniques/<technique>/card.yaml. Update the reference to point to the correct location.

🔧 Proposed fix

## Related Artifacts - **Operational skill**: [skills/perf-techniques/parallelism-strategies.md](../skills/perf-techniques/parallelism-strategies.md) — enablement, pitfalls, memory estimation, verification -- **Knowledge card**: [knowledge/techniques/parallelism_strategies.yaml](../knowledge/techniques/parallelism_strategies.yaml) — structured metadata and validation status +- **Knowledge card**: [skills/perf-techniques/parallelism-strategies/card.yaml](../skills/perf-techniques/parallelism-strategies/card.yaml) — structured metadata and validation status

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

## Related Artifacts

- **Operational skill**: [skills/perf-techniques/parallelism-strategies.md](../skills/perf-techniques/parallelism-strategies.md) — enablement, pitfalls, memory estimation, verification

- **Knowledge card**: [knowledge/techniques/parallelism_strategies.yaml](../knowledge/techniques/parallelism_strategies.yaml) — structured metadata and validation status

## Related Artifacts

- **Operational skill**: [skills/perf-techniques/parallelism-strategies.md](../skills/perf-techniques/parallelism-strategies.md) — enablement, pitfalls, memory estimation, verification

- **Knowledge card**: [skills/perf-techniques/parallelism-strategies/card.yaml](../skills/perf-techniques/parallelism-strategies/card.yaml) — structured metadata and validation status

🧰 Tools

🪛 GitHub Actions: Build docs

[warning] 511-511: Sphinx: cross-reference target not found: '../knowledge/techniques/parallelism_strategies.yaml' [myst.xref_missing]

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@docs/parallelisms.md` around lines 508 - 512, Update the broken cross-reference in docs/parallelisms.md: replace the non-existent ../knowledge/techniques/parallelism_strategies.yaml link with the relocated card path under skills/perf-techniques (e.g., ../skills/perf-techniques/parallelism-strategies/card.yaml) so the "Knowledge card" link in the "Related Artifacts" section points to the correct card.yaml; edit the link text that currently references parallelism_strategies.yaml and ensure it matches the directory name used by the existing skill file (parallelism-strategies) and the card.yaml filename.

coderabbitai · 2026-03-17T21:44:46Z

docs/training/communication-overlap.md

+- `skills/perf-techniques/tp-dp-comm-overlap.md`
+- `skills/perf-techniques/expert-parallel-overlap.md`


⚠️ Potential issue | 🟡 Minor

Update these skill references to the new .../SKILL.md layout.

These pointers still use the pre-split skills/perf-techniques/*.md paths, but this PR is moving skills into per-technique directories with SKILL.md/card.yaml pairs. At least skills/perf-techniques/expert-parallel-overlap/SKILL.md already follows that layout, so the current file references will be stale immediately.

📝 Suggested path fix

- `skills/perf-techniques/tp-dp-comm-overlap.md` - `skills/perf-techniques/expert-parallel-overlap.md` + `skills/perf-techniques/tp-dp-comm-overlap/SKILL.md` + `skills/perf-techniques/expert-parallel-overlap/SKILL.md` ... - `skills/perf-techniques/tp-dp-comm-overlap.md` - `skills/perf-techniques/expert-parallel-overlap.md` + `skills/perf-techniques/tp-dp-comm-overlap/SKILL.md` + `skills/perf-techniques/expert-parallel-overlap/SKILL.md`

Also applies to: 101-102

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@docs/training/communication-overlap.md` around lines 12 - 13, Update the stale skill file references that point to pre-split markdown files by changing them to the new per-technique SKILL.md layout: replace `skills/perf-techniques/tp-dp-comm-overlap.md` with `skills/perf-techniques/tp-dp-comm-overlap/SKILL.md` and replace `skills/perf-techniques/expert-parallel-overlap.md` with `skills/perf-techniques/expert-parallel-overlap/SKILL.md`; also apply the same path updates to the other occurrences noted (101-102) so all references point to the new SKILL.md locations.

coderabbitai · 2026-03-17T21:44:46Z

docs/training/cuda-graphs.md

@@ -0,0 +1,84 @@
+# CUDA Graphs


⚠️ Potential issue | 🟠 Major

Add this page to the docs toctree.

The docs build is already warning that this page is not included in any toctree, so the new document is currently orphaned and easy to miss. Please wire it into the relevant docs index/toctree before merging.

As per coding guidelines, "Ensure docs/index.md is updated when new markdown docs are added under docs/**/*.md or markdown files are renamed".

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@docs/training/cuda-graphs.md` at line 1, The new document docs/training/cuda-graphs.md is orphaned; update the relevant toctree by adding a reference to "training/cuda-graphs.md" in the docs index—specifically edit docs/index.md (or the appropriate docs/training index) and insert the relative path or title entry into the toctree/list so the page is included in the site navigation; ensure the entry matches the existing toctree style and run a local docs build to verify the warning is resolved.

coderabbitai · 2026-03-17T21:44:46Z

docs/training/cuda-graphs.md

+```python
+cfg.model.cuda_graph_impl = "transformer_engine"        # or "local"
+cfg.model.cuda_graph_scope = ["attn", "moe_router"]     # scope list
+cfg.model.cuda_graph_warmup_steps = 3                   # warmup before capture
+cfg.rng.te_rng_tracker = True                           # required
+```
+
+### Key constraints
+
+- `use_te_rng_tracker` must be `True` when `cuda_graph_impl != "none"`.
+- `full_iteration` scope requires `cuda_graph_impl = "local"` and
+  `rerun_state_machine.check_for_nan_in_loss = False`.


⚠️ Potential issue | 🟡 Minor

Use one RNG-tracker setting name here.

The example uses cfg.rng.te_rng_tracker, but the constraint list switches to use_te_rng_tracker. One of those names is wrong, so the setup instructions are currently self-contradictory.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@docs/training/cuda-graphs.md` around lines 45 - 56, The docs use two different RNG-tracker names—cfg.rng.te_rng_tracker in the example and use_te_rng_tracker in the constraints—so pick one and make them consistent; update the constraint line to refer to the same symbol used in the example (cfg.rng.te_rng_tracker) or conversely change the example to use use_te_rng_tracker, and ensure the constraint referencing cuda_graph_impl still reads "cfg.rng.te_rng_tracker must be True when cfg.model.cuda_graph_impl != 'none'".

coderabbitai · 2026-03-17T21:44:46Z

docs/training/hybrid-context-parallel.md

+# Hybrid / Hierarchical Context Parallel
+
+This page covers the stable Bridge-facing meaning of hierarchical context
+parallelism, especially the `a2a+p2p` transport path and
+`hierarchical_context_parallel_sizes`.
+
+For operational setup, code anchors, and verification commands, see
+`skills/perf-techniques/hybrid-context-parallel.md`.


⚠️ Potential issue | 🟠 Major

Add this new page to a Sphinx toctree before merging.

The docs pipeline is already warning document isn't included in any toctree [toc.not_included], so this page will be orphaned in the built docs until docs/index.md or the relevant training toctree includes it. Based on learnings, "Ensure docs/index.md is updated when new markdown docs are added under docs/**/*.md or markdown files are renamed".

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@docs/training/hybrid-context-parallel.md` around lines 1 - 8, The new doc "Hybrid / Hierarchical Context Parallel" (hybrid-context-parallel.md) is not listed in any Sphinx toctree and will be orphaned; update the relevant toctree (e.g., add an entry in docs/index.md or the training toctree that aggregates docs/training/*.md) to include "training/hybrid-context-parallel.md" so the page is reachable in the built docs and the toc.not_included warning is resolved.

coderabbitai · 2026-03-17T21:44:47Z

skills/perf-techniques/megatron-fsdp/card.yaml

+summary: >
+  Megatron FSDP is the practical FSDP path in Megatron-Bridge today. PyTorch
+  FSDP2 exists in code, but remains experimental and failed at runtime during
+  live validation on the current branch.


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

cat skills/perf-techniques/megatron-fsdp/card.yaml

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 2172

Update torch_fsdp2_runtime status to reflect the documented runtime failure.

The field is marked unclear but known_limitations documents a specific FSDP2 runtime failure: "Live validation of the current branch hit a Torch FSDP2 runtime TypeError from pg_collection." Change the status from unclear to explicitly reflect this failure.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@skills/perf-techniques/megatron-fsdp/card.yaml` around lines 3 - 6, The torch_fsdp2_runtime status in the card.yaml is inconsistent with the documented failure in known_limitations; update the torch_fsdp2_runtime field (in skills/perf-techniques/megatron-fsdp/card.yaml) from "unclear" to an explicit status reflecting the runtime failure (e.g., "failed" or "runtime_failure") and ensure the value aligns with the known_limitations note about the Torch FSDP2 TypeError from pg_collection so readers see the runtime failure state consistently.

coderabbitai · 2026-03-17T21:44:47Z

skills/perf-techniques/megatron-fsdp/SKILL.md

+```python
+cfg = llama3_8b_pretrain_config()
+cfg.dist.use_megatron_fsdp = True
+cfg.ddp.use_megatron_fsdp = True
+cfg.ddp.data_parallel_sharding_strategy = "optim_grads_params"
+cfg.checkpoint.ckpt_format = "fsdp_dtensor"
+cfg.checkpoint.save = "/tmp/fsdp_ckpts"
+cfg.checkpoint.load = None
+```


⚠️ Potential issue | 🟡 Minor

Keep the copy-paste recipe aligned with the recommended override set.

The example on Lines 27-35 drops cfg.ddp.average_in_collective = False, even though the minimal override above and the perf-harness overrides on Lines 91-98 both include it. As written, users who copy the example end up testing a different configuration than the one this skill recommends.

Suggested fix

cfg = llama3_8b_pretrain_config() cfg.dist.use_megatron_fsdp = True cfg.ddp.use_megatron_fsdp = True cfg.ddp.data_parallel_sharding_strategy = "optim_grads_params" +cfg.ddp.average_in_collective = False cfg.checkpoint.ckpt_format = "fsdp_dtensor" cfg.checkpoint.save = "/tmp/fsdp_ckpts" cfg.checkpoint.load = None

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@skills/perf-techniques/megatron-fsdp/SKILL.md` around lines 27 - 35, The example config block for Megatron FSDP is missing the cfg.ddp.average_in_collective = False override and thus diverges from the recommended minimal override and perf-harness overrides; update the snippet that sets cfg.dist.use_megatron_fsdp, cfg.ddp.use_megatron_fsdp, cfg.ddp.data_parallel_sharding_strategy, cfg.checkpoint.ckpt_format, cfg.checkpoint.save, and cfg.checkpoint.load to also set cfg.ddp.average_in_collective = False so the copy-paste example matches the recommended override set.

coderabbitai · 2026-03-17T21:44:47Z

skills/perf-techniques/parallelism-strategies/SKILL.md

+7. `CUDA_DEVICE_MAX_CONNECTIONS` and related env vars interact with
+   overlap settings. See `skills/perf-techniques/tp-dp-comm-overlap.md`.


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Find the correct path for tp-dp-comm-overlap documentation fd -t f "SKILL.md" skills/perf-techniques/tp-dp-comm-overlap/ fd -t f "tp-dp-comm-overlap" skills/

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 121

Fix incorrect cross-reference path.

Line 208 references skills/perf-techniques/tp-dp-comm-overlap.md, but the file is located at skills/perf-techniques/tp-dp-comm-overlap/SKILL.md.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@skills/perf-techniques/parallelism-strategies/SKILL.md` around lines 207 - 208, The cross-reference path in the paragraph mentioning `skills/perf-techniques/tp-dp-comm-overlap.md` is incorrect; update that reference to the correct target `skills/perf-techniques/tp-dp-comm-overlap/SKILL.md` so the link points to the existing SKILL.md file (edit the occurrence of `skills/perf-techniques/tp-dp-comm-overlap.md` in the current `SKILL.md` file and replace it with `skills/perf-techniques/tp-dp-comm-overlap/SKILL.md`).

coderabbitai · 2026-03-17T21:44:47Z

skills/perf-techniques/sequence-packing/SKILL.md

+If CUDA graphs are enabled for this packed path:
+
+```python
+cfg.dataset.packed_sequence_specs.pad_cu_seqlens = True
+cfg.dataset.dataset_kwargs["pad_to_max_length"] = True
+```


⚠️ Potential issue | 🟡 Minor

Call out the metadata prerequisite for pad_cu_seqlens.

This block currently implies those two flags are sufficient, but src/megatron/bridge/data/datasets/sft.py:765-772 also asserts that packed-sequence metadata is present when pad_cu_seqlens=True. Without that note, users with custom packed datasets will hit an assertion after following this section.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@skills/perf-techniques/sequence-packing/SKILL.md` around lines 39 - 44, The doc block must mention that enabling pad_cu_seqlens requires packed-sequence metadata to be present; update the text to explicitly state that when cfg.dataset.packed_sequence_specs.pad_cu_seqlens = True you must supply packed-sequence metadata for your dataset (or disable pad_cu_seqlens), because the dataset code (see src.megatron.bridge.data.datasets.sft assertion in the packed-sequence handling) will assert that metadata exists; reference pad_cu_seqlens, cfg.dataset.dataset_kwargs["pad_to_max_length"], and the assertion in the sft dataset so users with custom packed datasets know to generate/provide the required packed metadata.

coderabbitai · 2026-03-17T21:44:47Z

skills/perf-techniques/tp-dp-comm-overlap/card.yaml

+known_constraints:
+  - TP overlap requires tensor_model_parallel_size > 1.
+  - TP overlap requires sequence_parallel=True.
+  - DP overlap is tied to the distributed-optimizer path.
+  - PP overlap behavior depends on the pipeline schedule and is not identical for every PP setup.
+  - Launch-time environment tuning is part of practical overlap behavior.


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# Find the comm_overlap.py file and check around lines 421-465 find . -type f -name "comm_overlap.py" | head -5

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 117

🏁 Script executed:

# Find the card.yaml file mentioned in the review find . -type f -path "*/tp-dp-comm-overlap/card.yaml"

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 124

🏁 Script executed:

# Check if the line range 421-465 exists and what's in it if [ -f "src/megatron/bridge/training/comm_overlap.py" ]; then wc -l src/megatron/bridge/training/comm_overlap.py echo "=== Lines 421-465 ===" sed -n '421,465p' src/megatron/bridge/training/comm_overlap.py fi

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 2311

🏁 Script executed:

# Search for Transformer Engine checks in the comm_overlap.py file rg "TransformerEngine|TE|transformer.engine" src/megatron/bridge/training/comm_overlap.py -A 3 -B 3 || echo "No matches found"

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 2366

🏁 Script executed:

# Check the card.yaml file content to see the current known_constraints cat src/megatron/bridge/skills/perf-techniques/tp-dp-comm-overlap/card.yaml

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 174

🏁 Script executed:

# Use the correct path found earlier cat ./skills/perf-techniques/tp-dp-comm-overlap/card.yaml

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 2113

Add the Transformer Engine prerequisite to known_constraints.

The card marks TP overlap gating as code-verified, but known_constraints omits a gate that the code enforces. In src/megatron/bridge/training/comm_overlap.py:444-446, TP overlap is disabled when Transformer Engine is unavailable (elif not HAVE_TE). This constraint should be documented alongside the tensor-parallel size and sequence-parallel requirements.

📝 Suggested addition

known_constraints: - TP overlap requires tensor_model_parallel_size > 1. - TP overlap requires sequence_parallel=True. + - TP overlap requires Transformer Engine to be available. - DP overlap is tied to the distributed-optimizer path.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

known_constraints:

- TP overlap requires tensor_model_parallel_size > 1.

- TP overlap requires sequence_parallel=True.

- DP overlap is tied to the distributed-optimizer path.

- PP overlap behavior depends on the pipeline schedule and is not identical for every PP setup.

- Launch-time environment tuning is part of practical overlap behavior.

known_constraints:

- TP overlap requires tensor_model_parallel_size > 1.

- TP overlap requires sequence_parallel=True.

- TP overlap requires Transformer Engine to be available.

- DP overlap is tied to the distributed-optimizer path.

- PP overlap behavior depends on the pipeline schedule and is not identical for every PP setup.

- Launch-time environment tuning is part of practical overlap behavior.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@skills/perf-techniques/tp-dp-comm-overlap/card.yaml` around lines 31 - 36, Add a new known_constraints entry stating that TP overlap requires Transformer Engine to be present; specifically mention that TP overlap is disabled when HAVE_TE is false (see the gating in comm_overlap where the code checks elif not HAVE_TE), so include "TP overlap requires Transformer Engine (HAVE_TE)". Keep the same YAML list style and place it alongside the existing TP overlap constraints referencing tensor_model_parallel_size and sequence_parallel.

…h HCP docs - Add skills/perf-techniques/moe-comm-overlap/ (SKILL.md + card.yaml) - Add skills/perf-techniques/packed-sequences-long-context/ (SKILL.md + card.yaml) - Enrich hybrid-context-parallel SKILL.md with implementation map and pitfalls - Move conceptual content (what/when) from skills into docs per source-of-truth split: - docs/training/hybrid-context-parallel.md: CP backends table, KV-head limit, decision tree - docs/training/communication-overlap.md: MoE dispatch/combine overlap details - docs/training/packed-sequences.md: three-path table (offline SFT, VLM, CP) - Note SP interaction with CP packing (pad_seq_to_mult = 2 * CP * TP) Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> Made-with: Cursor

yaoyu-33 · 2026-03-18T01:12:03Z

/ok to test 75c9995

yaoyu-33 · 2026-03-18T01:39:34Z

/claude review

claude · 2026-03-18T01:42:29Z

docs/parallelisms.md

+For operational details on configuring combined parallelism, troubleshooting
+layouts, and memory estimation, see the
+[parallelism strategies skill](../skills/perf-techniques/parallelism-strategies.md).
+


Bug: This links to ../skills/perf-techniques/parallelism-strategies.md, but the actual file is skills/perf-techniques/parallelism-strategies/SKILL.md. The link will 404.

Suggested change

[parallelism strategies skill](../skills/perf-techniques/parallelism-strategies/SKILL.md).

claude · 2026-03-18T01:42:30Z

docs/parallelisms.md

+
+- **Operational skill**: [skills/perf-techniques/parallelism-strategies.md](../skills/perf-techniques/parallelism-strategies.md) — enablement, pitfalls, memory estimation, verification


Two broken links here:

../skills/perf-techniques/parallelism-strategies.md should be ../skills/perf-techniques/parallelism-strategies/SKILL.md

../knowledge/techniques/parallelism_strategies.yaml — this file doesn't exist in the repo and isn't added by this PR

claude · 2026-03-18T01:42:32Z

skills/perf-techniques/hybrid-context-parallel/SKILL.md

+
+Verify the supported MPU path:
+
+```bash
+CUDA_VISIBLE_DEVICES=0,1,2,3 uv run python -m torch.distributed.run --nproc_per_node=4 \


The verification section references scripts/verify_hybrid_context_parallel.py, but this script doesn't exist in the repo and isn't added by this PR. Either add the script or replace this with a verification method that uses existing files.

- Fix broken cross-references: update all stale *.md skill paths to the new */SKILL.md layout used throughout this PR - Add cuda-graphs.md and hybrid-context-parallel.md to docs/index.md toctree so they are no longer orphaned in the Sphinx build - Fix RNG tracker naming: use cfg.rng.te_rng_tracker consistently instead of mixing with use_te_rng_tracker in cuda-graphs.md - Add Local Checkpointing section to resiliency.md with When to Use It and Stable Constraints so the feature table entry is not incomplete - Add moe_shared_expert_overlap = False as a hard required constraint in expert-parallel-overlap/SKILL.md (it is asserted at runtime) - Replace hard-coded "41 passed" with "zero failures" wording in expert-parallel-overlap/SKILL.md so it does not drift - Add cfg.ddp.average_in_collective = False to FSDP example recipe to match the minimal override set documented above it - Update megatron-fsdp/card.yaml torch_fsdp2_runtime from unclear to known_failure to reflect the documented runtime TypeError - Replace non-existent scripts/verify_hybrid_context_parallel.py in hybrid-context-parallel/SKILL.md and card.yaml with existing test_decentralized_pg.py and a manual smoke-check recipe invocation - Add pad_cu_seqlens metadata prerequisite note in sequence-packing/SKILL.md (asserted in sft.py:765-772) - Fix cross-reference in parallelism-strategies/SKILL.md line 208 Signed-off-by: Yu Yao <yaoyu.094@gmail.com> Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

yaoyu-33 · 2026-03-18T01:49:56Z

/ok to test 4f6cca7

All skill and cross-doc references in docs/training/ were plain backtick literals. Convert them to [text](relative-path) hyperlinks so they are clickable in GitHub and rendered Sphinx docs. Covers communication-overlap.md, cuda-graphs.md, hybrid-context-parallel.md, megatron-fsdp.md, packed-sequences.md, and resiliency.md. Signed-off-by: Yu Yao <yaoyu.094@gmail.com> Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

yaoyu-33 · 2026-03-18T01:54:05Z

/ok to test f543ff5

docs: separate technique guidance into docs, skills, and knowledge

e397f32

Clarify which training guidance is stable documentation versus operational enablement so technique knowledge can be maintained without duplicated docs. Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> Made-with: Cursor

yaoyu-33 added docs Documentation-only updates or documentation debt area:misc Cross-cutting utilities, logging, helpers, and other changes docs-only With great power comes great responsibility. labels Mar 15, 2026

cuichenx previously approved these changes Mar 16, 2026

View reviewed changes

yaoyu-33 dismissed cuichenx’s stale review via 033eb68 March 16, 2026 21:00

yaoyu-33 marked this pull request as ready for review March 17, 2026 21:30

coderabbitai bot reviewed Mar 17, 2026

View reviewed changes

claude bot reviewed Mar 18, 2026

View reviewed changes

malay-nagda requested a review from erhoo82 March 18, 2026 07:24

		- `skills/perf-techniques/tp-dp-comm-overlap.md`
		- `skills/perf-techniques/expert-parallel-overlap.md`

		7. `CUDA_DEVICE_MAX_CONNECTIONS` and related env vars interact with
		overlap settings. See `skills/perf-techniques/tp-dp-comm-overlap.md`.


	[parallelism strategies skill](../skills/perf-techniques/parallelism-strategies/SKILL.md).


		- Operational skill: [skills/perf-techniques/parallelism-strategies.md](../skills/perf-techniques/parallelism-strategies.md) — enablement, pitfalls, memory estimation, verification

Conversation

yaoyu-33 commented Mar 15, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Summary by CodeRabbit

Uh oh!

yaoyu-33 commented Mar 15, 2026

Uh oh!

copy-pr-bot bot commented Mar 15, 2026

Uh oh!

yaoyu-33 commented Mar 17, 2026

Uh oh!

coderabbitai bot commented Mar 17, 2026

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

yaoyu-33 commented Mar 18, 2026

Uh oh!

yaoyu-33 commented Mar 18, 2026

Uh oh!

claude bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

yaoyu-33 commented Mar 18, 2026

Uh oh!

yaoyu-33 commented Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yaoyu-33 commented Mar 15, 2026 •

edited by coderabbitai bot

Loading